Duke Statistical Science | Graduation with Distinction
April 11, 2023
dsbox, an introductory data science tutorial package| Passage | Learning Objective(s) |
|---|---|
| Storm Paths | modeling; simulation; uncertainty |
| Movie Budgets 1 | compare summary statistics visually |
| Movie Budgets 2 | modeling; \(R^2\); compare trends visually |
| Application Screening | ethics; modeling; proxy variable |
| Banana Conclusions | causation; statistical communication |
| COVID Map | complex visualization; spatial data; time series; sophisticated scales |
| He Said She Said | basic visualization; sophisticated scales |
| Build-a-Plot | data to visualization process |
| Disease Screening | compare classification diagnostics visually |
| Realty Tree | modeling; regression tree; variable selection |
| Website Testing | compare trends visually; uncertainty; modeling; time series; extrapolation |
| Image Recognition | ethics; modeling; representativeness of training data |
| Data Confidentiality | ethics; data deidentification; statistical communication |
| Activity Journal | structure data; store data |
| Movie Wrangling | data cleaning; data wrangling; column-wise string operations; pseudocode; joins |
You are working on a team that is making a deterministic model to quickly screen through applications for a new position at the company. Based on employment laws, your model may not include variables such as age, race, and gender, which could be potentially discriminatory.
Your colleague suggests including a rule that eliminates candidates with more than 20 years of previous work experience, because they may have high salary expectations. Why might using this variable be considered unethical? Explain your answer.
You are working on a team that is making a deterministic model to quickly screen through applications for a new position at the company. Based on employment laws, your model may not include variables such as age, race, and gender, which could be potentially discriminatory.
Your colleague suggests including a rule that eliminates candidates with more than 20 years of previous work experience, because they may have high salary expectations. Are there ethical implications of using this variable to select candidates? Explain your answer.
A newspaper reports on the results of a survey from a small (<2000 student) college. The college agrees to have the data released to the public so long as the students’ identities and academic standing information are kept confidential. Which of the following combinations of variables is less likely to unintentionally identify any students? Explain.
a. Year, major, sports played
b. Year, major
A newspaper reports on the results of a survey from a small (<2000 student) university. The university agrees to have the data released to the public so long as the students’ identities and academic standing information are kept confidential. Which of the following combinations of variables is less likely to unintentionally identify any students? Explain.
a. Year, major, sports played
b. Year, major
A newspaper reports on the results of a survey from a small (<2000 student) university. The university agrees to have the data released to the public so long as the students’ identities and academic standing information are kept confidential. Which of the following combinations of variables is less likely to unintentionally identify any students? Explain.
a. Class year and sports played
b. Student ID and dorm zip code
c. GPA and major
d. Birth date and phone number
e. None of the above
A data scientist at IMDb has been given a dataset comprised of the revenues and budgets for 2,349 movies made between 1986 and 2016.
Suppose they want to compare several distributional features of the budgets among four different genres—Horror, Drama, Action, and Animation. To do this, they create the following plots.
Fill in the following table by placing a check mark in the cells corresponding to the attributes of the data that can be determined by examining each of the plots.
| Plot A | Plot B | Plot C | Plot D | |
|---|---|---|---|---|
| Mean | ☐ | ☐ | ☐ | ☐ |
| Median | ☐ | ☐ | ☐ | ☐ |
| IQR | ☐ | ☐ | ☐ | ☐ |
| Shape | ☐ | ☐ | ☐ | ☐ |
The table below provides data about 10 movies released in the United States. It provides data on the movie’s title , the movie’s director, the date the movie was released, the season the movie was released, the worldwide gross intake in U.S. dollars, the cleaned version of the worldwide gross intake in U.S. dollars, and whether or not the movie won the Best Picture Oscar.
| title | director | release_date | season | gross | gross_clean | best_picture |
|---|---|---|---|---|---|---|
| Almost Famous | Cameron Crowe | 22 September 2000 | Fall | $47.39M | 47.39 | No |
| CODA | Sian Heder | 13 August 2021 | Summer | $1.61M | 1.61 | Yes |
| E.T. the Extra-Terrestrial | Steven Spielberg | 11 June 1982 | Summer | $792.91M | 792.91 | No |
| Luca | Enrico Casarosa | 18 June 2021 | Summer | $49.75M | 49.75 | No |
| Middle of Nowhere | Ava DuVernay | 1 September 2014 | Fall | $0.24M | 0.24 | No |
| Moonlight | Barry Jenkins | 18 November 2016 | Fall | $65.34M | 65.34 | Yes |
| Parasite | Bong Joon Ho | 8 November 2019 | Fall | $262.69M | 262.69 | Yes |
| Say Anything | Cameron Crowe | 14 April 1989 | Spring | $21.52M | 21.52 | No |
| Selma | Ava DuVernay | 9 January 2015 | Winter | $66.79M | 66.79 | No |
| We Bought a Zoo | Cameron Crowe | 23 December 2011 | Winter | $120.08M | 120.08 | No |
The table below provides data about 10 movie directors. It provides data on the director’s name, the number of Oscars the movie’s director has been nominated for, and the number of Oscars the director has won.
| director | nominations | oscars |
|---|---|---|
| Ava DuVernay | 1 | 0 |
| Barry Jenkins | 3 | 1 |
| Bong Joon Ho | 3 | 3 |
| Cameron Crowe | 3 | 1 |
| Enrico Casarosa | 2 | 0 |
| Loveleen Tandan | 0 | 0 |
| Nora Ephron | 3 | 0 |
| Penny Marshall | 0 | 0 |
| Sian Heder | 1 | 1 |
| Steven Spielberg | 19 | 3 |
dsbox packagedsbox packageGrowing interest in DS requires scalability
Data Science in a Box project
Turning it into dsbox
2 key packages: learnr and gradethis.
learnr: robust, broad framework.
gradethis: sophisticated testing logic.
9 existing, 1 started
Modifying for interactive tutorial
```{r common-themes, exercise = TRUE}
lego_sales |>
___(___)
```
```{r common-themes-hint-1}
Look at the previous question for help!
```
```{r common-themes-solution}
lego_sales |>
count(theme, sort = TRUE)
```
```{r common-themes-check}
grade_this({
if(identical(as.character(.result[1,1]), "Star Wars")) {
pass("You have counted themes and sorted the counts correctly.")
}
if(identical(as.character(.result[1,1]), "Advanced Models ")) {
fail("Did you forget to sort the counts in descending order?")
}
if(identical(as.character(.result[1,1]), "Classic")) {
fail("Did you accidentally sort the counts in ascending order?")
}
if(identical(as.character(.result[1,1]), "Adventure Camp")) {
fail("Did you count subthemes instead of themes?")
}
if(identical(as.numeric(.result[1,2]), 172)) {
fail("Did you count subthemes instead of themes?")
}
fail("Not quite. Take a peek at the hint!")
})
```
Comprehensive R Archive Network
Package DESCRIPTION file
gradethis still in development
Advanced computing
Interacting with others’ code
“Teaching material is only way to master it”
New appreciation for existing educational materials
Inspired me to continue interacting with the world of open source software
Browse at your own pace at https://evandragich.github.io/thesis-work/
Email me at emd48@duke.edu